library(tidyverse)
library(knitr)
A paper, Decreasing human body
temperature in the United States since the Industrial Revolution,
presented evidence that human body temperatures in the United States
have been decreasing over the past one hundred or so years. (Many
scientists dispute the conclusions of the paper.) One of the data sets
in the paper is taken from the NHANES (National Health and Nutrition
Examination Study), and is available in the file
NHANES_processed.csv. There are many variables in the data,
but our focus will be on the variable temp that provides
resting oral body temperatures.
nhanes <- read.csv("NHANES_processed.csv",header = TRUE)
names <- read.csv("CensusNames.csv")
nhanes
Temperture <- nhanes$temp
hist(Temperture)
The empirical cumulative distribution function (ecdf) gives the proportion of data values at or below a particular value. In the case of the body temperature data, \(\text{ecdf}(98.1)\), for example, gives the proportion of temperatures in the data set below \(98.1\) degrees. A few minutes of thought reveals that
stat_ecdf function in ggplot2 to
draw an “empirical cumulative distribution function” for the body
temperature variable.ggplot(nhanes,aes(Temperture))+
stat_ecdf(geom = "step")
Look at the plot of the ecdf to gain more insight into the behavior of the ecdf.
geom_vline() function with an
xintercept argument.ggplot(nhanes,aes(Temperture)) +
stat_ecdf(geom = "step") +
geom_vline(xintercept=98) + geom_vline(xintercept=98.4) + geom_vline(xintercept=98.6)
How do these compare with the percentiles computed via the
summary() function?
summary(nhanes)
X V1 study_ID sample_weights
Min. : 1 Min. :Inf Length:15301 Min. : 0
1st Qu.: 6467 1st Qu.:Inf Class :character 1st Qu.: 1288
Median :12884 Median :Inf Mode :character Median : 4752
Mean :12641 Mean :Inf Mean : 7536
3rd Qu.:19117 3rd Qu.:Inf 3rd Qu.: 10381
Max. :23808 Max. :Inf Max. :100990
temp time_HR race sex
Min. : 95.20 Min. : 0.00 Length:15301 Length:15301
1st Qu.: 98.00 1st Qu.:10.93 Class :character Class :character
Median : 98.40 Median :14.67 Mode :character Mode :character
Mean : 98.31 Mean :14.42
3rd Qu.: 98.60 3rd Qu.:16.57
Max. :101.80 Max. :23.97
NA's :8769
age year_of_birth exam_date exam_year
Min. :21.00 Min. :1896 Length:15301 Min. :1971
1st Qu.:31.00 1st Qu.:1908 Class :character 1st Qu.:1972
Median :45.00 Median :1928 Mode :character Median :1973
Mean :46.55 Mean :1926 Mean :1973
3rd Qu.:64.00 3rd Qu.:1941 3rd Qu.:1974
Max. :75.00 Max. :1953 Max. :1975
exammonth exam_findings exam_ICD exam_ICD2
Min. : 1.000 Min. :0.000 Min. : 9 Min. : 11
1st Qu.: 4.000 1st Qu.:2.000 1st Qu.: 528 1st Qu.:9999
Median : 7.000 Median :2.000 Median :9999 Median :9999
Mean : 6.506 Mean :1.626 Mean :6148 Mean :8273
3rd Qu.: 9.000 3rd Qu.:2.000 3rd Qu.:9999 3rd Qu.:9999
Max. :12.000 Max. :8.000 Max. :9999 Max. :9999
region birth_cohort head_eyes_ears_nose_findings
Length:15301 Length:15301 Mode :logical
Class :character Class :character FALSE:4456
Mode :character Mode :character TRUE :10845
thyroid_findings chest_findings cardiovascular_findings abdominal_findings
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:340 FALSE:3868 FALSE:2374 FALSE:6255
TRUE :14961 TRUE :11433 TRUE :12927 TRUE :9046
musculosceletal_findings neurological_findings skin_findings
Mode :logical Mode :logical Mode :logical
FALSE:2209 FALSE:1770 FALSE:2143
TRUE :13092 TRUE :13531 TRUE :13158
general_findings no_findings thyroid weight_KG
Mode :logical Mode :logical Min. :0.000 Min. : 32.32
FALSE:3046 FALSE:12136 1st Qu.:1.000 1st Qu.: 58.97
TRUE :12255 TRUE :3165 Median :1.000 Median : 68.49
Mean :1.078 Mean : 70.44
3rd Qu.:1.000 3rd Qu.: 79.72
Max. :8.000 Max. :181.44
height_CM BMI temp_C bmi_adj
Min. :132.8 Min. :12.41 Min. :35.11 Min. :-12.8240
1st Qu.:159.6 1st Qu.:21.84 1st Qu.:36.67 1st Qu.: -2.2594
Median :165.6 Median :24.68 Median :36.89 Median : 0.5705
Mean :166.3 Mean :25.42 Mean :36.84 Mean : 1.3745
3rd Qu.:172.6 3rd Qu.:28.00 3rd Qu.:37.00 3rd Qu.: 3.9716
Max. :205.1 Max. :72.26 Max. :38.78 Max. : 48.9415
height_norm weight_norm
Min. :-22.252 Min. :-26.59393
1st Qu.: 4.458 1st Qu.: 0.05607
Median : 10.458 Median : 9.57607
Mean : 11.204 Mean : 11.52316
3rd Qu.: 17.458 3rd Qu.: 20.80607
Max. : 49.978 Max. :122.52607
The 20th,50th,75th line up pretty good compared to the plots vs summary
They are both very close 98.40 and 98.31 close, tells us that the distribution of data is very close in range
The file CensusNames.csv contains data on surnames in
the United States, including the name, the rank of the name among all
names, and the number of people in the United States with that name.
(Some uncommon names are not included in the data.)
name <- names %>%
filter(name == "DOVER") %>%
group_by(count)
ggplot(name, aes(x=count)) +
geom_histogram()
ggplot(name,aes(count))+
stat_ecdf(geom = "step")
They dint really give much information to draw a conclusion about the information
log10() computes base 10
logarithms.ggplot(name, aes(x=log10(count))) +
geom_histogram()
ggplot(name,aes(log10(count)))+
stat_ecdf(geom = "step")
summary() function. Are they
similar?summary(name$count)
Min. 1st Qu. Median Mean 3rd Qu. Max.
9742 9742 9742 9742 9742 9742
Maybe my plots are wrong but they dont really comapre
The numbers are all the same.
This section is similar to the previous one.
The file UnitedStatesCities.csv contains data on cities
in the United States, including the name of the city, the population
rank of the city among all cities, and the number of people living in
the city in 2010.
city <- read.csv("UnitedStatesCities.csv")
city
city %>%
filter(City == "Houston city, Texas") %>%
select(Rank, Population2010)
ggplot(city, aes(x = Population2010)) +
geom_histogram(
binwidth = 10000,
fill = "steelblue",
color = "black"
) +
labs(
title = "Histogram of U.S. City Populations",
x = "Population",
y = "Count"
) +
theme_minimal()
ggplot(city, aes(x = Population2010)) +
stat_ecdf(geom = "step") +
labs(
title = "ECDF of U.S. City Populations",
x = "Population",
y = "ECDF"
) +
theme_minimal()
city <- city %>%
mutate(logpop = log10(Population2010))
# Histogram of log10(population)
ggplot(city, aes(x = logpop)) +
geom_histogram(binwidth = 0.1, fill = "darkcyan", color = "black") +
labs(title = "Histogram of log10(Population)",
x = "log10(Population)",
y = "Count") +
theme_minimal()
# ECDF of log10(population)
ggplot(city, aes(x = logpop)) +
stat_ecdf(geom = "step") +
labs(title = "ECDF of log10(Population)",
x = "log10(Population)",
y = "ECDF") +
theme_minimal()
summary() function. Are
they similar?summary(city$logpop)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 2.567 3.059 3.138 3.660 6.912
# Extract exactly the 1st, 2nd, and 3rd quartiles via quantile()
quantile(city$logpop, probs = c(0.25, 0.5, 0.75))
25% 50% 75%
2.567026 3.059185 3.660439